To investigate the datasets from OECD.Stat we will now scrape the website for data. I scrape the content of the website and extract the HTML table
tag.
whole_table <- url_html %>% #we assign the data to a new object
html_nodes("table") %>%
html_table(fill = TRUE) #str(whole_table) turns out to be a list
If you run a head() on the resulting whole_table, you will see that it is a list with unnamed elements marked by numbers in double brackets, such as [[1]].
new_table <- do.call(cbind,unlist(whole_table, recursive = FALSE))
## Warning in (function (..., deparse.level = 1) : number of rows of result is not
## a multiple of vector length (arg 33)
head(new_table)
## Indicator Indicator Indicator
## [1,] "Age Group" "Age Group" "Age Group"
## [2,] "Unit" "Unit" "Unit"
## [3,] "Sex" "Sex" "Sex"
## [4,] "Time" "Time" "Time"
## [5,] "Time" "Time" "Time"
## [6,] "Time" "Time" "Time"
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1990"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1991"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1992"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1993"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1994"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1995"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1996"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1997"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1998"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "1999"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2000"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2001"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2002"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2003"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2004"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2005"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2006"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2007"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2008"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2009"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2010"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2011"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2012"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2013"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2014"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2015"
## [5,] ""
## [6,] NA
## Length of maternity leaveLength of parental leave with job protectionTotal length of paid maternity and parental leaveLength of paid father-specific leave
## [1,] "Total"
## [2,] "Weeks"
## [3,] "Women"
## [4,] "2016"
## [5,] ""
## [6,] NA
## X1 X2
## [1,] "Data extracted on 07 Dec 2020 16:30 UTC (GMT) from OECD.Stat" NA
## [2,] "Data extracted on 07 Dec 2020 16:30 UTC (GMT) from OECD.Stat" NA
## [3,] "Data extracted on 07 Dec 2020 16:30 UTC (GMT) from OECD.Stat" NA
## [4,] "Data extracted on 07 Dec 2020 16:30 UTC (GMT) from OECD.Stat" NA
## [5,] "Data extracted on 07 Dec 2020 16:30 UTC (GMT) from OECD.Stat" NA
## [6,] "Data extracted on 07 Dec 2020 16:30 UTC (GMT) from OECD.Stat" NA
## X1 X2 X3 X1 X2
## [1,] "Loading..." "Loading..." NA "Loading..." NA
## [2,] "Loading..." "" NA "Loading..." NA
## [3,] "Loading..." "Loading..." NA "Loading..." NA
## [4,] "Loading..." "" NA "Loading..." NA
## [5,] "Loading..." "Loading..." NA "Loading..." NA
## [6,] "Loading..." "" NA "Loading..." NA
The line above takes the whole table from the website with all rows and column. This is very difficult to read. It does not look nice and it shows things that we are not interested in. Therefore we tidy the data in the following section of this guide.
Tidying the data
In our case we only want to look at the countries’ average of weeks of maternity leave through the time period (1990 to 2016). Therefore we filter the data as shown in the following lines
new_table <- as.data.frame(new_table) #create a new table
mleave <- new_table[8:44,2:30] #choose the rows and columns we want to look at
mleave <- mleave[,c(1,3:29)] #choose the columns we want to both eliminate and look at
colnames(mleave) <- c("country", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016") #assign new column names to years
mleave <- as.data.frame(lapply(mleave, function(y) gsub("\\.\\.", NA, y))) #change the .. to NA, så R can read the data
In the table the numbers are set at characters. In the following I change these to numeric, so that R can make read the table correctly.
cols.num <- c("X1990", "X1991", "X1992", "X1993", "X1994", "X1995", "X1996", "X1997", "X1998", "X1999", "X2000", "X2001", "X2002", "X2003", "X2004", "X2005", "X2006", "X2007", "X2008", "X2009", "X2010", "X2011", "X2012", "X2013", "X2014", "X2015", "X2016") #columns from characters to numeric
mleave[cols.num] <- sapply(mleave[cols.num],as.numeric) #recode characters as numeric
sapply(mleave, class) #print classes of all colums
## country X1990 X1991 X1992 X1993 X1994
## "character" "numeric" "numeric" "numeric" "numeric" "numeric"
## X1995 X1996 X1997 X1998 X1999 X2000
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X2001 X2002 X2003 X2004 X2005 X2006
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X2007 X2008 X2009 X2010 X2011 X2012
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X2013 X2014 X2015 X2016
## "numeric" "numeric" "numeric" "numeric"
In the table the countries are set to be characters. In the following I change these to factors.
mleave$country <- as.factor(mleave$country) #columns from characters to factor
In the following I restructure the table so it knows that every country has numbers from different years.
mleave_long <- mleave %>% gather(year, weeks_mleave, -country) #restructuring the table so it knows that every country has numbers from different years
Now the table is readable for R.
VISUALIZATIONS
Now that we have finished tidying the table from OECD.Stat, we can begin making visualizations of this independent data set. In order to do so we need a few more packages.
library(gifski) #to convert images to GIF animations
library(png) #to read, write and display the PNG format
library(scales) #to manipulate and polish visualizations
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggthemes) #to make nice visualizations
library(gganimate) #to make animated visualizations
There are several ways of visualizing data. You could e.g. look up line geoms, point geoms, bars geoms. One way of vizualizing data is the following.
mleave_long %>%
ggplot(aes(x = year, y = weeks_mleave))+
geom_jitter(aes(color = country))
## Warning: Removed 170 rows containing missing values (geom_point).
Another way of visualizing the data can be done like this:
ggplot(mleave_long) +
geom_line(mapping = aes(x = year, y = weeks_mleave, group = country, color = country))
## Warning: Removed 170 row(s) containing missing values (geom_path).
As we can see above, many years are included in the visualization. We want to make our visualizations more clear and specific by focusing on the countries in the Nordic Region: Denmark, Iceland, Sweden, Norway and Finland. We chose to compare these countries, as they have similiar scandinavian welfare models.
mnordic <- mleave_long %>% filter(country %in% c("Denmark","Iceland", "Sweden","Norway", "Finland")) #assigning the nordic countries
mnordic$year <- str_remove(mnordic$year, "X") #filtering out the "X" that popped in to our model earlier
``
Now that we have created this object with the specific countries we will use it to visualize the differences within the countries. We will put it in the earlier ggplot-visualization and add a title.
ggplot(mnordic) +
geom_line(mapping = aes(x = year, y = weeks_mleave, group = country, color = country)) + ggtitle("Length of Maternity Leave in Nordic Countries") +theme_solarized()+ scale_color_solarized()+
xlab("Year")+ #naming the x axis
ylab("Weeks") #naming the y axis
Here we used the theme solarized. You could choose among many different from this site: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
The visualization above shows the length of maternity leave. With this script it is possible to compare the different countries and to select specific countries you want to compare.
Furthermore, since the dataset which we want to compare this with only have data from 2011-2018, we filter the years so our new assigned value will include the countries 2011-2016. We assign this ‘mnselected’
mnselected <- mleave_long %>% filter(country %in% c("Denmark","Iceland", "Sweden","Norway", "Finland")) %>% filter(year %in% c("X2011", "X2012", "X2013", "X2014", "X2015", "X2016"))
mnselected$year <- str_remove(mnselected$year, "X") #filtering out the "X" that popped in to our model earlier
mnselected
## country year weeks_mleave
## 1 Denmark 2011 18.0
## 2 Finland 2011 17.5
## 3 Iceland 2011 13.0
## 4 Norway 2011 9.0
## 5 Sweden 2011 15.6
## 6 Denmark 2012 18.0
## 7 Finland 2012 17.5
## 8 Iceland 2012 13.0
## 9 Norway 2012 9.0
## 10 Sweden 2012 15.6
## 11 Denmark 2013 18.0
## 12 Finland 2013 17.5
## 13 Iceland 2013 13.0
## 14 Norway 2013 9.0
## 15 Sweden 2013 15.6
## 16 Denmark 2014 18.0
## 17 Finland 2014 17.5
## 18 Iceland 2014 13.0
## 19 Norway 2014 17.0
## 20 Sweden 2014 15.6
## 21 Denmark 2015 18.0
## 22 Finland 2015 17.5
## 23 Iceland 2015 13.0
## 24 Norway 2015 13.0
## 25 Sweden 2015 15.6
## 26 Denmark 2016 18.0
## 27 Finland 2016 17.5
## 28 Iceland 2016 13.0
## 29 Norway 2016 13.0
## 30 Sweden 2016 19.9
2: SCRAPING SECOND DATA FROM OECD.STAT
Scraping Data
You can now choose another dataset that you would like to work with. Here we scrape the content of the website and extract the HTML table:
url <- "https://stats.oecd.org/index.aspx?queryid=96330"
# scrape the website
url_html <- read_html(url)
The following option let’s us extract the whole HTML table through the
tag.
whole_table <- url_html %>%
html_nodes("table") %>%
html_table(fill = TRUE) #str(whole_table) turns out to be a list
If you run a head() on the resulting whole_table, you will see that it is a list with unnamed elements marked by numbers in double brackets, such as [[1]].
new_table <- do.call(cbind,unlist(whole_table, recursive = FALSE))
head(new_table)
## Indicator Indicator Indicator EMP10NEW: Share of female managers
## [1,] "Sex" "Sex" "Sex" "Women"
## [2,] "Age Group" "Age Group" "Age Group" "Total"
## [3,] "Unit" "Unit" "Unit" "Percentage"
## [4,] "Time" "Time" "Time" "2011"
## [5,] "Time" "Time" "Time" ""
## [6,] "Time" "Time" "Time" NA
## EMP10NEW: Share of female managers EMP10NEW: Share of female managers
## [1,] "Women" "Women"
## [2,] "Total" "Total"
## [3,] "Percentage" "Percentage"
## [4,] "2012" "2013"
## [5,] "" ""
## [6,] NA NA
## EMP10NEW: Share of female managers EMP10NEW: Share of female managers
## [1,] "Women" "Women"
## [2,] "Total" "Total"
## [3,] "Percentage" "Percentage"
## [4,] "2014" "2015"
## [5,] "" ""
## [6,] NA NA
## EMP10NEW: Share of female managers EMP10NEW: Share of female managers
## [1,] "Women" "Women"
## [2,] "Total" "Total"
## [3,] "Percentage" "Percentage"
## [4,] "2016" "2017"
## [5,] "" ""
## [6,] NA NA
## EMP10NEW: Share of female managers
## [1,] "Women"
## [2,] "Total"
## [3,] "Percentage"
## [4,] "2018"
## [5,] ""
## [6,] NA
## X1 X2
## [1,] "Data extracted on 07 Dec 2020 17:05 UTC (GMT) from OECD.Stat" NA
## [2,] "Data extracted on 07 Dec 2020 17:05 UTC (GMT) from OECD.Stat" NA
## [3,] "Data extracted on 07 Dec 2020 17:05 UTC (GMT) from OECD.Stat" NA
## [4,] "Data extracted on 07 Dec 2020 17:05 UTC (GMT) from OECD.Stat" NA
## [5,] "Data extracted on 07 Dec 2020 17:05 UTC (GMT) from OECD.Stat" NA
## [6,] "Data extracted on 07 Dec 2020 17:05 UTC (GMT) from OECD.Stat" NA
## X1 X2 X3 X1 X2
## [1,] "Loading..." "Loading..." NA "Loading..." NA
## [2,] "Loading..." "" NA "Loading..." NA
## [3,] "Loading..." "Loading..." NA "Loading..." NA
## [4,] "Loading..." "" NA "Loading..." NA
## [5,] "Loading..." "Loading..." NA "Loading..." NA
## [6,] "Loading..." "" NA "Loading..." NA
The line above takes the whole table from the website with some rows and coloumns. It does not look nice and it shows thing that I am not interested in.
Tidying the data
In our case we only want to look at the countries’ share of female managers through the time period (2011 to 2018). Therefore we filter the data as shown in the following lines
fmanagers <- new_table[8:52,2:18] #choose the rows and columns we want to look at
fmanagers <- fmanagers[,c(1,3:10)] #choose the columns we want to both eliminate and look at
colnames(fmanagers) <- c("country", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018") #assign new column names to years
I can see that the data is listed in characters (), and I want to convert some of it to numeric to be able to create visualizations (I got help from this website: https://stackoverflow.com/questions/22772279/converting-multiple-columns-from-character-to-numeric-format-in-r)
Now we convert our table to be ‘as tibble’. In that way the numbers will be converted into dataframes, which will help our further analysis
fmanagers <- as.tibble(fmanagers)
## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
cols.num <- c("2011","2012","2013","2014","2015","2016","2017","2018") # we tell r which colons should be numeric
fmanagers[cols.num] <- sapply(fmanagers[cols.num],as.numeric) # in this section we convert the colomns in our dataset to be numeric instead of characters
To be able to create the wanted visualizations, I will need to assign the different countries as factors so that I can distinguish between them. In the following I will ascribe the countries as factors:
fmanagers$country <- as.character(fmanagers$country)
sapply(fmanagers, class)
## country 2011 2012 2013 2014 2015
## "character" "numeric" "numeric" "numeric" "numeric" "numeric"
## 2016 2017 2018
## "numeric" "numeric" "numeric"
In the following we restructure the table so that R can read it.
fmanagers_long <- fmanagers %>% gather(year, percentage, -country) # I want to tidy my data further so that there will be a specific column with the year assigned. Now I can hopefully start visualizing my data...
Now we have finished tidying the data and restructuring the table. Therefore we will now be able to move on with the visualizations
Visualizations of second data set
theme_set(theme_bw()) #set the theme to black and white for better visualization
ggplot(fmanagers_long) +
geom_line(mapping = aes(x = year, y =percentage, group = country)) #by selecting the country and mapping that the visualization should include years, countries and percentage
Here we have all the years and all the countries listed. But it does not look very manageable, so we will try to pick out the countries we want to compare and illustrate them in different colors. The chosen countries are: Denmark, Iceland, Colombia, Italy, and the OECD average.
fmanagers_long %>%
ggplot(aes(x = year, y = percentage))+
geom_jitter(aes(color = country))

Another way of making visualizations is the following:
ggplot(fmanagers_long) +
geom_line(mapping = aes(x = year, y = percentage, group = country, color = country))
As there are so many countries in the data set, the vizualization is not very nice. Therefore we now want to select some countries. We choose five countries; Denmark, Iceland, Finland, Norway and Sweden to see how the share of female managers and the development in these countries look. Therefore we create a new object called ‘fnordic’.
fnordic <- fmanagers_long %>% filter(country %in% c("Denmark","Iceland", "Sweden","Norway", "Finland")) #here you can chose the countries
fnordic
## # A tibble: 40 x 3
## country year percentage
## <chr> <chr> <dbl>
## 1 Denmark 2011 27.6
## 2 Finland 2011 31.6
## 3 Iceland 2011 39.3
## 4 Norway 2011 32.9
## 5 Sweden 2011 34.3
## 6 Denmark 2012 28.2
## 7 Finland 2012 29.4
## 8 Iceland 2012 39.7
## 9 Norway 2012 33.6
## 10 Sweden 2012 35.3
## # … with 30 more rows
Now that we have created this object with the specific countries we will use it to visualize the differences within the countries. We will put it in the earlier ggplot-visualization and add a title.
ggplot(fnordic) +
geom_line(mapping = aes(x = year, y = percentage, group = country, color = country)) + ggtitle("Share of Female Managers in Nordic Countries")+ theme_solarized()+ scale_color_solarized()+ #add title
xlab("Year")+ #to rename the x-axis
ylab("Percentage")#to rename the y-axis

Here we used the theme solarized. You could choose among many different from this site: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
The visualization above shows the share of female managers. With this script it is possible to compare the different countries and to select specific countries you want to compare. As we can see above, the countries that Denmark normally compare it self with are doing a lot better than Denmark when it comes to the share of female managers. Also, Denmark (and Finland) is doing worse than the OECD average which includes the 45 OECD countries.
Furthermore, since the dataset which we want to compare this with have data from 1990-2016, we filter the years so our new assigned value will include the countries 2011-2016. We assign this ‘fnselected’
fnselected <- fmanagers_long %>% filter(country %in% c("Denmark","Iceland", "Sweden","Norway", "Finland")) %>% filter(year %in% c("2011","2012", "2013","2014", "2015", "2016")) #here you can chose the countries and the years you want to focus on
Here we visualize our new assigned function, ‘fnselected’:
ggplot(fnselected) +
geom_line(mapping = aes(x = year, y = percentage, group = country, color = country)) + ggtitle("Share of Female Managers in Nordic Countries")+ theme_solarized()+ scale_color_solarized()+ #add title
xlab("Year")+ #to rename the x-axis
ylab("Percentage")#to rename the y-axis

To make it easier to follow, we will post the visualization showing the length of maternity leave in the Nordic Region, ‘mnselected’
ggplot(mnselected) +
geom_line(mapping = aes(x = year, y = weeks_mleave, group = country, color = country)) + ggtitle("Length of Maternity Leave in Nordic Countries") +theme_solarized()+ scale_color_solarized()+ #add a title
xlab("Year")+ #naming the x axis
ylab("Weeks") #naming the y axis
Now, that we can see these two visualizations showing the development from 2011-2016 in the Nordic Region, we want to show how it is possible to merge the two datasets/visualizations!
MERGING THE DATASETS AND CREATING ANIMATION
We have worked with two data sets from OECD.Stat independently. In the following section we merge the two data sets to and create an animation. We ask the question: How does the length of maternity leave influence the share of female managers in the Nordic countries?
Firstly we put the data sets together. We do this by the function ‘merged’. You want to make sure that your two datasets have the same variables and rows to compare them.
merged <- merge(mnordic, fnordic, by=c("year", "country"))
Then we will try to make visualizations on this. At first, we make the comparisons within a certain year, for instance 2016.
# I will try to merge the two statistics with specifik years
merged2016 <- filter(merged, year == "2016") #we can compare the different countries on maternity leave and share of female managers in 2016
# Here, we will visualize the status in 2016 on the length of maternity leave and share of female managers in the Nordic Region. We call this animation 'anim1', but you could also call it something else
anim1 <- ggplot(merged2016,aes(x = weeks_mleave, y = percentage, group = country, color = country))+
geom_point(size=8, alpha = 0.5)+
xlim(10,24)+
ylim(25,41)+
xlab("Weeks")+
ylab("Percentage")+
ggtitle("MATERNITY LEAVES' INFLUENCE ON SHARE OF FEMALE MANAGERS")+
geom_text(aes(label=country),hjust=0, vjust=2.5)+theme_solarized()+ scale_color_solarized()+
theme(legend.position = "none")
anim1

Now we want to make a moving visualization with changing years.
class(merged$year) #the class of year is character, so we need to change it
## [1] "character"
merged$year <- as.numeric(as.character(merged$year)) #we change 'year' to numeric, so the visualization will be able to move through the years as numbers
anim2 <-ggplot(merged, aes(x = weeks_mleave, y = percentage, group = country, color = country)) +
geom_point(size=8, alpha = 0.4)+
transition_time(year)+
xlab("Weeks of maternity leave")+
ylab("Percentage of female managers")+
labs(title = "Year: {round(frame_time)}")+
theme_solarized()
anim2 #we call this visualization anim2, but it could also have a different name

Now we have maked the data move! As we can see, Denmark is the country with the longest length of maternity leave and the lowest share of female managers. However it does not look at if there is a direct correlation between the length of maternity leave and the share of female managers.
If you want to use this animation in powerpoints or other kinds of presentations, it is a good idea to save it as a GIF. Below is a guide on that:
#Make it a GIF
animate(anim2, fps = 10, width = 750, height = 450) #we want to make it a different format

anim_save("gender.gif")
WRAP IT UP
Now, we have learned how to scrape, code, tidy and visualize data from OECD.Stat. Good job! We hop you can use this in your future research. If you want to contact us, our GitHubs are: “helenefonne” and “sofia-sif”.